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Abstract 

We describe a model that enables us to analyze the running time of an algorithm in a computer with 
a memory hierarchy with limited associativity, in terms of various cache parameters. Our model, an 
extension of Aggarwal and Vitter's I/O model, enables us to establish useful relationships between the 
cache complexity and the I/O complexity of computations. As a corollary, we obtain cache-optimal al- 
gorithms for some fundamental problems like sorting, FFT, and an important subclass of permutations 
in the single-level cache model. We also show that ignoring associativity concerns could lead to infe- 
rior performance, by analyzing the average-case cache behavior of mergesort. We further extend our 
model to multiple levels of cache with limited associativity and present optimal algorithms for matrix 
transpose and sorting. Our techniques may be used for systematic exploitation of the memory hierarchy 
starting from the algorithm design stage, and dealing with the hitherto unresolved problem of limited 
associativity. 



1 Introduction 

Models of computation are essential for abstracting the complexity of real machines and enabling the design 
and analysis of algorithms. The widely-used RAM model owes its longevity and usefulness to its simplic- 
ity and robustness. While it is far removed from the complexities of any physical computing device, it 
successfully predicts the relative performance of algorithms based on an abstract notion of operation counts. 

The RAM model assumes a flat memory address space with unit-cost access to any memory location. 
With the increasing use of caches in modern machines, this assumption grows less justifiable. On modern 
computers, the running time of a program is as much a function of operation count as of its cache reference 
pattern. A result of this growing divergence between model and reality is that operation count alone is 
not always a true predictor of the running time of a program, and manifests itself in anomalies such as a 
matrix multiplication algorithm demonstrating 0(n 5 ) running time instead of the expected 0(n 3 ) behavior 
predicted by the RAM model [§]. Such shortcomings of the RAM model motivate us to seek an alternative 
model that more realistically models the presence of a memory hierarchy. In this paper, we address the issue 
of better and systematic utilization of caches starting from the algorithm design stage. 
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A challenge in coming up with a good model is achieving a balance between abstraction and fidelity, so 
as not to make the model unwieldy for theoretical analysis or simplistic to the point of lack of predictiveness. 
Memory hierarchy models used by computer architects to design caches have numerous parameters and 
suffer from the first shortcoming [[[], ^6|]. Early algorithmic work in this area focussed on a two-layered 



memory model [|21|] — a very large capacity memory with slow access time (secondary memory) and a limited 
size faster memory (internal memory). All computation is performed on elements in the internal memory 
and there is no restriction on placement of elements in the internal memory (fully associative). 

The focus of this paper is on the interaction between main memory and cache, which is the first level 
of memory hierarchy once the address is provided by the CPU. The structure of a single level hierarchy of 
cache memory is adequately characterized by the following three parameters: Associativity, Block size, and 
Capacity. Capacity and block size are in units of the minimum memory access size (usually one byte). A 
cache can hold a maximum of C bytes. However, due to physical constraints, the cache is divided into cache 
frames of size B that contain B contiguous bytes of memory — called a memory block. The associativity A 
specifies the number of different frames in which a memory block can reside. If a block can reside in any 
frame (i.e., A = ^), the cache is said to be, fully associative; if A = 1, the cache is said to be direct-mapped; 
otherwise, the cache is A-way set associative. 

For a given memory access, the hardware inspects the cache to determine if the corresponding memory 
element is resident in the cache. This is accomplished by using an indexing function to locate the appropriate 
set of cache frames that may contain the memory block. If the memory block is not resident, a cache miss is 



said to occur. From an architectural standpoint, cache misses can be classified into one of three classes Q20|]. 



A compulsory miss (also called a cold miss) is one that is caused by referencing a previously unref- 
erenced memory block. Eliminating a compulsory miss requires prefetching the data, either by an 
explicit prefetch operation or by placing more data items in a single memory block. 

A reference that is not a compulsory miss but misses in a fully-associative cache with LRU replace- 
ment is classified as a capacity miss. Capacity misses are caused by referencing more memory blocks 
than can fit in the cache. Restructuring the program to re-use blocks while they are in cache can reduce 
capacity misses. 

• A reference that is not a compulsory miss that hits in a fully-associative cache but misses in an ^4-way 
set-associative cache is classified as a conflict miss. A conflict miss to block X indicates that block X 
has been referenced in the recent past, since it is contained in the fully-associative cache, but at least 
A other memory blocks that map to the same cache set have been accessed since the last reference to 
block X. Eliminating conflict misses requires transforming the program to change either the memory 
allocation and/or layout of the two arrays (so that contemporaneous accesses do not compete for the 
same sets) or the manner in which the arrays are accessed. 

Conflict misses pose an additional challenge in designing efficient algorithms in the cache. This class of 
misses is not present in the I/O models, where the mapping between internal and external memory is fully 
associative. 

Existing memory hierarchy models [Q, ||, ^ do not model certain salient features of caches, notably 
the lack of full associativity in address mapping and the lack of explicit control over data movement and 
replacement. Unfortunately, these small differences are malign in the effect.Q The conflict misses that they 



introduce make analysis of algorithms much more difficult [16]. Carter and Gatlin [gp conclude a recent 
paper saying 



1 See the discussion in ||9|] on a simple matrix transpose program. 
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What is needed next is a study of "messy details" not modeled by UMH (particularly cache as- 
sociativity) that are important to the performance of the remaining steps of the FFT algorithm. 

In the first part of this paper, we develop a two-level memory hierarchy model to capture the interaction 
between cache and main memory. Our model is a simple extension of the two-level I/O model that Aggarwal 
and Vitter [Q] proposed for analyzing external memory algorithms. However, it captures three additional 
constraints of caches: lower miss penalties; lack of full associativity in address mapping; and lack of explicit 
program control over data movement. The work in this paper shows that the constraint imposed by limited 
associativity can be tackled quite elegantly, allowing us to extend the results of the I/O model to the cache 
model very efficiently. 

Most modern architectures have a memory hierarchy consisting of multiple cache levels. In the second 
half of this paper, we extend the two-level cache model to a multi-level cache model. 

The remainder of this paper is organized as follows. Section || surveys related work. Section || defines 
our cache model and establishes an efficient emulation scheme between the I/O model and our cache model. 
As direct corollaries of the emulation scheme, we obtain cache-optimal algorithms for several fundamental 
problems such as sorting, FFT, and an important class of permutations. Section ^illustrates the importance 
of the emulation scheme by demonstrating that a direct (i.e., bypassing the emulation) implementation of 
an I/O-optimal sorting algorithm (multiway mergesort) is provably inferior, even in the average case, in the 
cache model. Section || describes a natural extension of our model to multiple levels of caches. We present 
an algorithm for transposing a matrix in the multi-level cache model that attains optimal performance in the 
presence of any number of levels of cache memory. Our algorithm is not cache-oblivious, i.e., we do make 
explicit use of the sizes of the cache at various levels. Next, we show that with some simple modifications, 
the funnel-sort algorithm of Frigo et al. attains optimal performance in a single level (direct mapped) cache 
in an oblivious sense, i.e., without prior knowledge of memory parameters. Finally, Section ^ presents 
conclusions, possible refinements to the model, and directions for future work. 



2 Related work 

The I/O model assumes that most of the data resides on disk and has to be transferred to main memory to do 
any processing. Because of the tremendous difference in speeds, it ignores the cost of internal processing 



and counts only the number of I/Os. Floyd [15] originally defined a formal model and proved tight bounds on 



the number of I/Os required to transpose a matrix using two pages of internal memory. Hong and Kung | |21| ] 
extended this model and studied the I/O complexity of FFT when the internal memory size is bounded by 
M. Aggarwal and Vitter [Q] further refined the model by incorporating an additional parameter B, the 
number of (contiguous) elements transferred in a single I/O operation. They gave upper and lower bounds 
on the number of I/Os for several fundamental problems including sorting, selection, matrix transposition, 
and FFT. Following their work, researchers have designed I/O-optimal algorithms for fundamental problems 



in graph theory Q13Q and computational geometry Q19p. 



Researchers have also modeled multiple levels of memory hierarchy. Aggarwal et al. [g|] defined the 
Hierarchical Memory Model (HMM) that assigns a function f(x) to accessing location x in the memory, 
where / is a monotonically increasing function. This can be regarded as a continuous analog of the multi- 
level hierarchy. Aggarwal et al. [||] added the capability of block transfer to the HMM, which enabled them 
to obtain faster algorithms. Alpern et al. []|] described the Uniform Memory Hierarchy (UMH) model, where 
the access costs differ in discrete steps. Very recently, Frigo et al. [JTJ] presented an alternate strategy of 
algorithm design on these models which has the added advantage that explicit values of parameters related 
to different levels of the memory hierarchy are not required. Bilardi and Peserico M\ investigate further the 
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complexity of designing algorithms without the knowledge architectural parameters. ^Other attempts were 
directed towards extracting better performance by parallel memory hierarchies [32, 33), 14], where several 
blocks could be transferred simultaneously. 

Ladner et al. [23] describe a stochastic model for performance analysis in cache. Our work is different 
in nature, as we follow a more traditional worst-case analysis. Our analysis of sorting in Section |4] provides 
a better theoretical basis for some of the experimental work of LaMarca and Ladner [25]. 

To the best of our knowledge, the only other paper that addresses the problem of limited associativity in 
cache is recent work of Mehlhorn and Sanders[27]. They show that for a class of algorithms based on merg- 
ing multiple sequences, the I/O algorithms can be made nearly optimal by use of a simple randomized shift 
technique. The emulation theorem in Section |5| of this paper not only provides a deterministic solution for 
the same class of algorithms, but also works for a very general situation. The results in [27] are nevertheless 
interesting from the perspective of implementation. 



3 The cache model 

The (two-level) I/O model of Aggarwal and Vitter [|[] captures the interaction between a slow (secondary) 
memory of infinite capacity and a fast (primary) memory of limited capacity. It is characterized by two 
parameters: M, the capacity of the fast memory; and B, the size of data transfers between slow and fast 
memories. Such data movement operations are called I/O operations or block transfers. The use of the 
model is meaningful when the problem size N ^> M. 

The I/O model contains the following further assumptions. 

1. A datum can be used in a computation iff it is present in fast memory. All data initially resides in 
slow memory. Data can be transfened between slow and fast memory (in either direction) by I/O 
operations. 

2. Since the latency for accessing slow memory is very high, the average cost of transfer per element can 
be reduced by transferring a block of B elements at little additional cost. This may not be as useful as 
it may seem at first sight, since these B elements are not arbitrary, but are contiguous in memory. The 
onus is on the programmer to use all the elements, as traditional RAM algorithms are not designed 
for such restricted memory access patterns. We denote the map from a memory address to its block 
address by B. The internal memory can hold at least three blocks, i.e., M ^ 3 • B. 

3. The computation cost is ignored in comparison to the cost of an I/O operation. This is justified by the 
high access latency of slow memory. 

4. A block of data from slow memory can be placed in any block of fast memory. 

5. I/O operations are explicit in the algorithm. 

The goal of algorithm design in this model is to minimize the number of I/O operations. 

We adopt much of the framework of the I/O model in developing a cache model to capture the interac- 
tions between cache and main memory. In this case, the cache is the fast memory, while main memory is 
the slow memory. Assumptions 1 and 2 of the I/O model continue to hold in our cache model. However, 
assumptions 3-5 are no longer valid and need to be replaced as follows. 

• The difference between the access times of slow and fast memory is considerably smaller than in 
the I/O model, namely a factor of 5-100 rather than factor of 10000. We will use L to denoted the 

2 However, none of these models address the problem of limited associativity in cache. 
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normalized cache latency. This cost function assigns a cost of 1 for accessing an element in cache and 
L for accessing an element in the main memory. This way, we also account for the computation in 
cache. 

• Main memory blocks are mapped into cache sets using a. fixed and pre-determined mapping function 
that is implemented in hardware. Typically, this is a modulo mapping based on the low-order address 
bits. However, the results of this section will hold as long as there is a. fixed address mapping function 
that distributes the main memory evenly in the cache. We denote this mapping from main memory 
blocks to cache sets by S. We will occasionally slightly abuse this notation and apply § directly to a 
memory address x rather than to B(x). 

• The cache is not visible to the programmer (not even at the assembly level). When a program issues 
a reference to a memory location x, an image (copy) of the main memory block b = B(x) is brought 
into the cache set S(6) if it is not already present there. The block b continues to reside in cache until 
it is evicted by another block b' that is mapped to the same cache set (i.e., 8(6) = §(&'))■ In other 
words, a cache set c contains the latest memory block referenced that is mapped to this set. 

To summarize, we use the notation £(M, B, L) to denote our three-parameter cache model, and the 
notation 3(M, B) to denote the I/O model with parameters M and B. We will use n and m to denote N/B 
and M / B respectively. The assumptions of our cache model parallel those of the I/O model, except as noted 
above.0 The goal of algorithm design in the cache model is to minimize running time, defined as the number 
of cache accesses plus L times the number of main memory accesses. 

3.1 Emulating I/O algorithms 

The differences between the two models listed above would appear to frustrate any efforts to naively map an 
I/O algorithm to the cache model, given that we neither have the control nor the flexibility of the I/O model. 
Our main result in this section establishes a connection between the I/O model and the cache model using a 
very simple emulation scheme. 

Theorem 3.1 (Emulation Theorem) An algorithm A in J(M, B) using T block transfers and I processing 
time can be converted to an equivalent algorithm A c in £(M, B, L) that runs in 0(1 + (L + B) -T) steps. 
The memory requirement of A c is an additional m + 2 blocks beyond that of A. 

Proof: Note that / is usually not accounted for in the I/O model, but we will keep track of the internal 
memory computation done in A in our emulation. The idea behind the emulation is as follows. We will 
mimic the behavior of the I/O algorithm A in the cache model, using an array Buf of m blocks to play the 
role of the fast memory. We will view the main memory in the cache model as an array Mem of S-element 
blocks. Although Buf is also part of the memory, we are using different notations to make their roles explicit 
in this proof. Likewise, we will view the cache as an array of sets and denote the ith set by C[i]. 

As discussed above, we do not have explicit control on the contents of the cache locations. However, we 
can control the memory access pattern through a level of indirection so as to maintain a 1-1 correspondence 
between Buf and the cache. Wlog, we assume that § maps block i of Buf to cache set C[i] for i € [1, m]. 

We divide the I/O algorithm into rounds, where in each round, the I/O algorithm A transfers a block 
between the slow memory and the fast memory and (possibly) does some computations. The cache algorithm 
A c transfers the same blocks between Mem and Buf and then does the identical computations in Buf. Figure 
[I] formally describes the procedure. Note that the B elements must be explicitly copied in the cache model. 

3 Frigo et al. mM independently arrive at a very similar parameterization of their model. 
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Round t of the emulation 



I/O Algorithm A 


Cache Emulation A c 


1. Transfer block b t from slow memory to 
block at of the fast memory 

2. Perform computations in fast memory 


1. Copy contents of the B locations of 

Mem[bt] into Buf[a t ] 

2. Perform identical computations in Buf 



Figure 1: The emulation scheme used in the proof of Theorem 3.1 



It must be obvious that the final outcome of algorithm A c is the same as algorithm A. The more inter- 
esting issue is the cost of the emulation. 

A block of size B is transferred into cache if its image does not exist in the cache at the time of reference. 
The invariant that we try to maintain at the end of each round is that there is a 1-1 correspondence between 
Buf and C. This will ensure that all the / operations are done within the cache at minimal cost. 

Assume that we have maintained the above invariant at the end of round t — 1. In round t, we transfer 
block Mem[b t ] into Buf [at]. Accessing the memory block Mem\b t ] will displace the existing block in cache 
set C[q], where q = §(&*). From the invariant, we know that the block displaced from C[q] is Buf[q], 
which must be restored to cache to restore the invariant. We can bring it back by a single memory reference 
and charge this to the round t itself, which is L. (Actually it will be brought back during the subsequent 
reference, so the previous step is only to simplify the accounting.) 

The cost of copying Mem[b t ] to Buf[a t ] is L + B assuming that Mem[b t ] and Buf [at] are not mapped to 
the same cache set (§(6t) ^ §(aj)). Otherwise it will cause alternate cache misses {thrashing) of the blocks 
Mem[b t ] and Buf [at] leading to L ■ B steps for copying. This can be prevented by transferring through an 
intermediate memory block Mem[Y] such that S(Y) ^ S(&t). Having two such intermediate buffers that 
map to distinct cache sets would suffice in all cases. So, we first transfer Mem[bt] to Mem[Y] followed by 
Mem[Y] to Buf[j]. The first copying has cost 2L + B since both blocks must be fetched from main memory. 
The second transfer is between blocks, one of which is present in the cache, so it has cost L + B. To this we 
must also add cost L for restoring the block of Buf that was mapped to the same cache set as Mem[Y]. So, 
the total cost of the safe method is 4L + 2B. 

The internal processing remains identical. If I t denotes the internal processing cost of step t, the total 
cost of the emulation is at most J2j=i {h + 2(L + B) + 2L) = I + 4L • T + 2B ■ T. □ 

Remark 1 

• A possible alternative to using intermediate memory-resident buffers to avoid thrashing is to use 
registers, since register access is much faster. In particular, if we have B registers, then we can save 
two extra memory accesses, bringing down the emulation cost to 2L + 2B. 

• We can make the emulation somewhat simpler by using a randomized mapping scheme. That is, if 
we choose the starting location of array Buf randomly, then the probability that Mem[b t ] and Buf [at] 
have the same image is 1/M. So the expected emulation cost is I + 2L • T + (B + (LB)/M) ■ T 
without using any intermediate copying. 



The basic idea of copying data into contiguous memory locations to reduce interference misses has 



been exploited before in some specific contexts like matrix multiplication [|24j and bit-reversal per- 
mutation [ph. Theorem 3.1 unifies these previous results within a common framework. 



6 



The term 0(B ■ T) is subsumed by 0(1) if computation is done on at least a constant fraction of the 
elements in the block transferred by the I/O algorithm. This is usually the case for efficient I/O algorithms. 
We will call such I/O algorithms block-efficient. 

Corollary 3.2 A block-efficient I/O algorithm for 3(M, B) that uses T block transfers and I processing can 
be emulated in £(M, B, L) in 0(1 + L ■ T) steps. 



Remark 2 The algorithms for sorting, FFT, matrix transposition, and matrix multiplication described in 
Aggarwal and Vitter [Q] are block-efficient. 



3.2 Extension to set-associative cache 

The trend in modern memory architectures is to allow limited flexibility in the address mapping between 
memory blocks and cache sets. The /c-way set-associative cache has the property that a memory block 
can reside in any (one) of k cache frames. Thus, k = 1 corresponds to the direct-mapped cache we have 
considered so far, while k = m corresponds to a fully associative cache. Values of k for data caches are 
generally small, usually in the range 1-4. 

If all the k sets are occupied, a replacement policy like LRU is used (by the hardware) to find an as- 
signment for the referenced block. The emulation technique of the previous section would extend to this 
scenario easily if we had explicit control on the replacement. This not being the case, we shall tackle it indi- 



rectly by making use of an useful property of LRU that Frigo etal. [18] exploited in the context of designing 
cache-oblivious algorithms for a fully associative cache. 

Lemma 3.1 (Sleator-Tarjan[[3(J) For any sequence s, Flru, the number of misses incurred by LRU with 
cache size riLRU w no more than (tilru / (n lru — no pt + 1 ) Fo pt), where Fo pt is the minimimum number 
of misses by an optimal replacement strategy with cache size uopt- 

We use this lemma in the following way. We run the emulation technique for only half the cache size, i.e., 
we choose the buffer to be of size m/2, such that for every k cache lines in a set, we have only k/2 buffer 



blocks. From Lemma 3.1. we know that the number of misses in each each cache set is no more than twice 



the optimal, which is in turn bounded by the number of misses incurred by the I/O algorithm. 

Theorem 3.3 (Generalized Emulation Theorem) An algorithm A in 3(M/2, B) using T block transfers 
and I processing time can be converted to an equivalent algorithm A c in the k-way set-associative cache 
model with parameters M, B, L that runs in 0(1 + (L + B) ■ T) steps. The memory requirement of A c is 
an additional m/2 + 2 blocks beyond that of A. 

3.3 The cache complexity of sorting and other problems 

Aggarwal and Vitter [Q] prove the following lower bound for sorting and FFT in the I/O model. 
Lemma 3.2 ([§]) The average-case and the worst-case number of I/O 's required for sorting N records and 



for computing the N -input FFT graph in 3(M, B) is Vt 



N log(l+AT/B) \ 
B log(l+A//B) )• 



Theorem 3.4 The lower bound for sorting in €(M, B, L) is Q(N log + i°|m/b )- 

Proof: Any lower bound in the number of block transfers in 3(M, B) carries over to £(M, B, L). Since 
the lower bound is the maximum of the lower bound on number of comparisons and the bound in Lemma 
[3.2[ , the theorem follows by dividing the sum of the two terms by 2. □ 
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Theorem 3.5 In <£(M, B, L), N numbers can be sorted in 0(N log N + L ■ ■ j^jjjg) steps and this is 
optimal. 

Proof: The A4/B-way mergesort algorithm described in Aggarwal and Vitter [Q] has an I/O complexity of 

JV log N/B \ 
B log M/B ' 

'' "" ' " " log M/B ' 

Corollary 3^, and Remark ||, the cost of this algorithm in the cache model is 0(N log N + L - ^ ■ 1^^//% )- 



^(^ logM/g )- ^ ne processing time involves maintaining a heap of size M/B and 0(log M/B) per output 
element. For A?~ elements, the number of phases is ■ lo | so the total processing time is 0(N log N). From 



Optimality follows from Theorem 3.4. □ 



Remark 3 The M/B-way distribution sort (multiway quicksort) also has the same upper bound. 
We can prove a similar result for FFT computations. 

Theorem 3.6 The FFT of N numbers can be computed in 0(N log N + L ■ ^1°^//^ ) in <£(M, B, L). 



Remark 4 The FFTW algorithm [|T7|] is optimal only for B = 1. Barve |Q] has independently obtained a 
similar result. 

The class of Bit Matrix Multiply Complement (BMMC) permutations include many important permutations 



like matrix transposition and bit reversal. Combining the work of Cormen et al. Q14| ] with our emulation 
scheme, we obtain the following result. 

Theorem 3.7 The class of BMMC permutations for N elements can be achieved in ( N + L ■ ^ i^mJW) ) 
steps in C(M, B, L). 



Remark 5 Many known geometric [13] and graph algorithms [19] in the I/O model, such as convex hull 



and graph connectivity, can be transformed optimally into the cache model. 

4 Average-case performance of mergesort in the cache model 

In this section, we analyze the average-case performance of fc-way mergesort in the cache model. Of the 
three classes of misses described in Section [T], we note that compulsory misses are unavoidable and that 
capacity misses are minimized while designing algorithms for the I/O model. We are therefore interested 
in bounding the number of conflict misses for a straightforward implementation of the I/O-optimal /c-way 
mergesort algorithm. It is easy to construct a worst-case input permutation where there will be a conflict 
miss for every input element (a cyclic distribution suffices), so the average case is more interesting. 

We assume that s cache sets are available for the leading blocks of the k runs Si, . . . , Sk- In other words, 
we ignore the misses caused by heap operations (or equivalently ensure that the heap area in the cache does 
not overlap with the runs). 

We create a random instance of the input as follows. Consider the sequence {1, . . . , A^}, and distribute 
the elements of this sequence to runs by traversing the sequence in increasing order and assigning element i 
to run Sj with probability l/k. From the nature of our construction, each run Si is sorted. We denote j-th 
element of Si as Sij. The expected number of elements in any run Si is N/k. 

During the A;-way merge, the leading blocks are critical in the sense that the heap is built on the leading 
element of every sequence Si. The leading element of a sequence is the smallest element that has not been 
added to the merged (output) sequence. The leading block is the cache line containing the leading element. 
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Let bi denote the leading block of run Si. Conflict can occur when the leading blocks of different sequences 
are mapped to the same cache set. In particular, a conflict miss occurs for element Sij+i when there is at 
least one element x G for some k ^ i, such that Sij < x < Sij+i and S(foj) = §(£>&)■ (We do not count 
conflict misses for the first element in the leading block, i.e., Sij and Sy+i must belong to the same block, 
but we will not be very strict about this in our calculations.) 

Let pi denote the probability of conflict for element i € [1, N]. Using indicator random variables Xi to 
count the conflict miss for element i, the total number of conflict misses X = £\ Xj. It follows that the 
expected number of conflict misses E[X] = ^2iE[Xi] = Yl,iPi- m tne remaining section we will try to 
estimate a lower bound on pi for i large enough to validate the following assumption. 

Al The cache sets of the leading blocks, §(6j), are randomly distributed in cache sets 1, . . . , s 
independent of the other sorted runs. Moreover, the exact position of the leading element within 
the leading block is also uniformly distributed in positions {!,..., sB}. 



Remark 6 A recent variation of the mergesort algorithm (see [Q]) actually satisfies Al by its very nature. 
So, the present analysis is directly applicable to its average-case performance in cache. A similar observation 



was made independently by Sanders [27] who obtained upper-bounds for mergesort for a set associative 
cache. 

From our previous discussion and the definition of a conflict miss, we would like to compute the proba- 
bility of the following event. 

El For some i,j, for all elements x, such that Sij < x < Sij+i, S(x) ^ S(Sjj). 

In other words, none of the leading blocks of the sorted sequences Sj, j ^ i, conflicts with bi. The prob- 
ability of the complement of this event (i.e., Pr[El]) is the probability that we want to estimate. We will 
compute an upper bound on Pr[£l], under the assumption Al, thus deriving a lower bound on Pr[£l]. 

Lemma 4.1 For k/s > e, Pr[£l] < 1 — 5, where e and 5 are positive constants (dependent only on s and 
k but not on nor B). 

Proof: See Appendix [a|. □ 
Thus we can state the main result of this section as follows. 

Theorem 4.1 The expected number of conflict misses in a random input for doing a k-way merge in an 
s-set direct-mapped cache, where k is Q(s), is $l(N), where N is the total number of elements in all the 
k sequences. Therefore the (ordinary I/O-optimal) M/B-way mergesort in an M/B-set cache will exhibit 
^(-^ logM/g ) cac ^ e misses which is asymptotically larger than the optimal value of O(^ ^^g )- 
Proof: The probability of conflict misses is S7(l) when k is Q(s). Therefore the expected total number of 
conflict misses is Q(N) for N elements. The I/O-optimal mergesort uses M/B-way merging at each of the 
log MTE l eve l s > hence the second part of the theorem follows. □ 

Remark 7 Intuitively, by choosing k <C s, we can minimize the probability of conflict misses resulting 
in an increased number of merge phases (and hence running time). This underlines the critical role of 
conflict misses vis-a-vis capacity misses that forces us to use only a small fraction of the available cache. 
Recently, Sanders [27] has shown that by choosing k to be 0( B ^ 1 / a ) in an a-way set associative cache with 



a modified version of mergesort of [0], the expected number of conflict misses per phase can be bounded by 
0(N/B). In comparision, the use of the emulation theorem guarantees minimal worst-case conflict misses 
while making good use of cache. 
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5 The Multi-level Cache Model 



Most modern architectures have a memory hierarchy consisting of multiple levels of cache. Consider two 
cache levels C\ and £2 preceding main memory, with C\ being faster and smaller. The operation of the 
memory hierarchy in this case is as follows. The memory location being referenced is first looked up in C\. 
If it is not present in C\, then it is searched for in £2 (these can be overlapped with appropriate hardware 
support). If the item is not present in L\ but it is in £2, then it is brought into C\. In case that it is not in £2, 
then a cache line is brought in from main memory into £2 and into L\. The size of cache line brought into 
£2 (denoted by B2) is usually no smaller than the one brought into C\ (denoted by B\). The expectation is 
that the more frequently used items will remain in the faster cache. 

The Multi-level Cache Model is an extension to multiple cache levels of the previously introduced Cache 
Model. Let £j denote the i-th level of cache memory. The parameters involved here are the problem size 
N, the size of £j which is denoted by Mi, the frame size (unit of allocation) of £j denoted by Bi and the 
latency factor Zj. If a data item is present in the £j, then it is present in Cj for all j ^ i (sometimes referred 
to as the inclusion property). If it is not present in £j, then the cost for a miss is Zj plus the cost of fetching 
it from £j+i (if it is present in £j+i, then this cost is zero). For convenience, the latency factor li is the ratio 
of time taken on a miss from the z-th level to the amount of time taken for a unit operation. 

Figure ^| shows the memory mapping for a two-level cache architecture. The shaded part of main mem- 
ory is of size B\ and therefore occupies only a part of a line of the £2 cache which is of size B2. There is a 
natural generalization of the memory mapping to multiple levels of cache. 

We make the following assumptions in this section, which are consistent with existing architectures. 

Al. For all i, Bi, Li are powers of 2. 

A2. 2Bi ^ -Bj+i and the number of Cache Lines Lj < Li + \. 

A3. Bk < L\ and 4Bk < B\L\ (i.e. B\ ^ 4) where £& is the largest and slowest cache. This 
implies that 



This will be useful for the analysis of the algorithms and are sometimes termed as tall cache in 
reference to the aspect ratio. 

5.1 Matrix Transpose 

In this section, we provide an approach for transposing a matrix in the Multi-level Cache Model. 

The trivial lower bound for matrix transposition of an N x N matrix in the multi-level cache hierarchy 
is clearly the time to scan iV 2 elements, namely, 



Bi is the number of elements in one cache line in d cache; L, is the number of cache lines in £j cache, 
which is and ij is the latency for £,, cache. 

Our algorithm uses a more general form of the emulation theorem to get the submatrices to fit into cache 
in a regular fashion. The work in this section shows that it is possible to handle the constraints imposed by 
limited associativity even in a multi-level cache model. 

We subdivide the matrix into B^ x B& submatrices. Thus we get \n/ B& \ x \n/Bk \ submatrices from 
annxn submatrix. 



Li ■ Bi ^ Bf. ■ Bi 



(1) 




where 
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Figure 2: Memory mapping in a two-level cache hierarchy 
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A = 



\ a n 2 -n+l 



f Ai A 2 ... A n/B \ 

\ A n 2_ nB l B A n 2/ B 2 J 



l n 2 J 



Note that the submatrices in the last row and column need not be square as one side may have < B rows 
or columns. 

Let m = n/B then 



A 1 



m+l 

A T A T 

^2 ^m+2 



A T 2 ^ \ 

A T 
A 2m 



\Al Aj;, J 

For simplicity, we describe the algorithm as transposing a square matrix A in another matrix B, i.e. 
B = A T . The main procedure is Rec_Trans(^4, B, s), where A is transposed into B by dividing A and B 
into s 2 submatrices and then recursively transposing the sub-matrices. Let A^j (Bij) denote the submatrices 
for 1 ^ i,j ^ s. Then B = A T can be computed as Rec_Trans( Aij, Bjj, s') for all i,j and some 
appropriate s' which depends on B\. and B\.-\. In general, if tk, tfc-i, . . . , t\ denote the values of s' at the 
1,2... level of recursion, then ti = Sj+i/Sj. If the submatrices are B\ x B\ (base case), then perform the 
transpose exchange of the symmetric submatrices directly. We perform matrix transpose as follows, which 
is similar to the familiar recursive transpose algorithm. 

1. Subdivide the matrix as shown into B^ x B^ submatrices. 

2. Move the symmetric submatrices to contiguous memory locations. 

3. Rec_Trans( A id , B jjh B k /B k -i). 

4. Write back the B k x B k submatrices to original locations. 

In the following subsections we analyze the data movement of this algorithm to bound the number of 
cache misses at various levels. 

5.2 Moving a submatrix to contiguous locations 

To move a submatrix we will move it cache line by cache line. By choice of size of submatrices (B k x B k ) 
each row will be an array of size B k , but the rows themselves may be far apart. 

Lemma 5.1 If two memory blocks x and y of size B k are aligned in C k -cache map to the same cache set in 
Ci-cache for some 1 < i < k, then x and y map to the same set in each Cj-cache for all 1 < j < i. 

Proof: If x and y map to the same cache set in £j cache then their i-th level memory block numbers (to 
be denoted by b % {x) and b l (y)) differ by a multiple of Lj. Let b l (x) — b l (y) = aLi. Since Lj\L{ (both are 
powers of two), b l (x) — b l (y) = j3Lj where f3 = a ■ Li/Lj. Let x', y' be the corresponding sub-blocks of 
x and y at the j-th level. Then their block numbers V(x'), W{y') differ by Bi/Bj ■ [3 ■ Lj, i.e., a multiple of 
Lj as Bj\Bi. Note that blocks are aligned across different levels of cache. Therefore x and y also collide in 
Cj. ' ~ □ 
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Corollary 5.1 If two blocks of size Bk that are aligned in Ck-cache do not conflict in level i they do not 
conflict in any level j for all i < j < k. 

Theorem 5.2 There is an algorithm which moves a set of blocks of size Bk ( where there are k levels of 
cache with block size Bifor each 1 < i < k) into a contiguous area in main memory in 

where N is the total data moved and k is the cost of a cache miss for the i th level of cache. 
Proof: Let the set of blocks of size Bk be / (we are assuming that the blocks are aligned). Let the target 
block in the contiguous area for each block i <G / be in the corresponding set J where each block j <G J is 
also aligned with a cache line in Ck Cache. 

Let block a map to Rb, a , b = {1,2, ... ,k} where i4,a denote the set of cache lines in the /Vcache. 
(Since a is of size Bk, it will occupy several blocks in lower levels of cache.) 

Let the i th block map to set Rk ,j of the Ck Cache. Let the target block j map to set Rk,j- In the worst 
case, Rkj is equal to Rk ,j. Thus in this case the line has to be moved to a temporary block say x 
(mapped to Rk, x ) an d then moved back to Rk,j- We choose x such that R± jX and R\^ do not conflict and 
also Ri^ x and Rij do not conflict. Such a choice of x is always possible because our temporary storage area 
X of size 4Bk has at least 4 lines of £fc-cache (i and j will take up two blocks of £fc-cache, thus leaving at 
least one block free to be used as temporary storage). This is why we have the assumption that 4Bk < B\L\. 
That is, by dividing the £i-cache into B\L\jBk zones, there is always a zone free for x. 

For convenience of analysis, we maintain the invariant that X is always in Ck-cache. By application of 
the previous corollary on our choice of x (such that R\^ / R^ x / R\,j) we also have R a ^ / R a ^ x / R a j 
for all 1 < a < k. Thus we can move i to x and x to j without any conflict misses. The number of cache 
misses involved is three for each level — one for getting the i th block, one for writing the j th block, and one 
to maintain the invariant since we have to touch the line displaced by i. Thus we get a factor of 3. 

Thus the cost of this process is 

where N is the amount of data moved. 

□ 

Remark 8 For blocks / that are not aligned in Ck Cache, the constant would increase to 4 since we would 
need to bring up to 2 cache lines for each i G /. The rest of the proof would remain the same. 

Corollary 5.3 A Bk><Bk submatrix can be moved into contiguous locations in the memory in 0(X^=i ^~h) 
time in a computer that has k levels of (direct-mapped) cache. 

This follows from the preceding discussion. We allocate memory say C of size Bk x Bk for placing the 
submatrix and memory, say, X of size 4Bk for temporary storage and keep both these areas distinct. 

Remark 9 If we have set associativity (> 2) in all levels of cache then we do not need an intermediate 
buffer x as line i and j can both reside in cache simultaneously and movement from one to the other will 
not cause thrashing. Thus the constant will come down to two. Since at any point in time we will only be 
dealing with two cache lines and will not need the lines i or j once we have read or written to them the 
replacement policy of the cache does not affect our algorithm. 
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Remark 10 If the capacity of the register file is greater than the size of the cache line (B^) of the outermost 
cache level (£&) then we can move data without worrying about collision by copying from line i to registers 
and then from registers to line j. Thus even in this case the constant will come down to two. 

Once we have the submatrices in contiguous locations we perform the transpose as follows. For each of 
the submatrices we divide the B r x B r submatrix (say S) in level C r (for 2 < r < k) further into f? r _i x 
-B r _i size submatrices as before. Each -B r _i x B r -i size subsubmatrix fits into C r -i cache completely 
(since S r _i • B r -\ ^ B r -\ ■ B^ ^ -B r -i • L r _i from equation ([[])). Let B r /B r -i = k r . 

Thus we have the submatrices as 

f Si t i Si t 2 ■ ■ ■ Si t k r 

So we perform matrix transpose of each Sij in place without incurring any misses as it resides com- 
pletely inside the cache. Once we have transposed each Sij we exchange Sij with Sjj. We will show that 
Sij and Sj : i can not conflict in £ r _i-cache for i / j. 

Since Si j and Sji lie in different parts of the £ r -cache lines, they will map to different cache sets in the 
£ r _i-cache. The rows of Sij and Sji correspond to (iB r -i + a\)k r + j and (jB r -i + d2)k r + i where 
«i, a>2 ^ {1) 2....B r _i} and 

B r / B r _i = k r . 

If these conflict then 

(iB r -x + a\)k r + j = (jB r -i + d2)k r + i(modL r _i). 
Since S r _ a = T and B r = 2 V and £ r _i = 2 W (all powers of two) 

U — nv-u 

IV f — _■ 

Therefore k r divides £ r _i (because k r = B r /B r ^\ < B r < L r _i). Hence 

j = i(modk r ). 

Since i, j < k r the above implies 

i = j. 

Note that Si/s do not have to be exchanged. Thus, we have shown that a B r x B r matrix can be di- 
vided into -B r _i x B r -\ which completely fits into £ r _i -cache. Moreover, the symmetric sub-matrices do 
not interfere with each other. The same argument can be extended to any Bj x Bj submatrix for j < r. 
Applying this recursively we end up dividing the B^ x B^ size matrix in £fc-cache to B\ x B\ sized sub- 
matrices in £i-cache, which can then be transposed and exchanged easily. From the preceding discussion, 
the corresponding submatrices do not interfere in any level of the cache. 

(Note that even though we keep subdividing the matrix at every cache level recursively and claim that 
we then have the submatrices in cache and can take the transpose and exchange them, the actual movement, 
i.e., transpose and exchange happens only at the £i-cache level, where the submatrices are of size B\ x B\.) 

The time taken by this operation is 

N 2 



E 



Bi 



-h. 



This is because each Sij and Sji pair (such that i ^ j) has to be brought into £ r _i Cache only once 
for transposing and exchanging of B\ x B\ submatrices. Similarly, at any level of cache, a block from the 
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Figure 3: Positions of symmetric submatrices in Cache 
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matrix is brought in only once. The sequence of the recursive calls ensures that each cache line is used 
completely as we move from sub-matrix to sub-matrix. 

Finally, we move the transposed symmetric submatrices of size Bk x Bk to their location in memory, 
i.e., reverse the process of bringing in blocks of size Bk from random locations to a contiguous block. This 



procedure is exactly the same as in Theorem in the previous section that has the constant 3 



Remark 11 The above constant of 3 for writing back the matrix to an appropriate location depends on the 
assumption that we can keep the two symmetric submatrices of size Bk x Bk in contiguous locations at the 
same time. This would allow us to exchange the matrices during the write back stage. If we are restricted to 
a contiguous temporary space of size Bk x Bk only, then we will have to move the data twice, incurring the 
cost twice. 



Remark 12 Even though in the above analysis we have always assumed a square matrix of size N x N the 
algorithm works correctly without any change for transposing a matrix of size M x N if we are transposing 
a matrix A and storing it in B. This is because the same analysis of subdividing into submatrices of size 
Bk x Bk and transposing still holds. However if we want to transpose a M x N matrix in place then the 
algorithm fails because the location to write back to would not be obvious and the approach used here would 
fail. 

Theorem 5.4 The algorithm for matrix transpose runs in 

steps in a computer that has k levels of direct-mapped cache. 

If we have temporary storage space of size 2Bk x Bk + and assume block alignment of all subma- 
trices then the constant is 7. This includes 3 for initial movement to contiguous location, 1 for transposing 
the symmetric submatrices of size Bk x Bk and 3 for writing back the transposed submatrix to its original 
location. Note that the constant is independent of the number of levels of cache. 

Remark 13 Even if we have set associativity (> 2) in any level of cache the analysis goes through as before 
(though the constants will come down for data copying to contiguous locations). For the transposing and 
exchange of symmetric submatrices the set associativity will not come into play because we need a line only 
once in the cache and are using only 2 lines at a given time. So either LRU or even FIFO replacement policy 
would only evict a line that we have already finished using. 

5.3 Sorting in multiple levels 

We first consider a restriction of the model described above where data cannot be transferred simultaneously 
across non-consecutive cache levels. We use Cj to denote Yjj=i Mj. 

Theorem 5.5 The lower bound for sorting in the restricted multi-level cache model is Q(N\og N+^2^ =1 ii- 

N log N/Bj \ 
Bi logd/Bj- 

Proof: The proof of Aggarwal and Vitter can be modified to disregard block transfers that merely rearrange 
data in the external memory. Then it can be applied separately to each cache level, noting that the data 
transfer in the higher levels do not contribute for any given level. □ 
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These lower bounds are in the same spirit as those of Vitter and Nodine [32] (for the S-UMH model) 



and Savage [|28|], that is, the lower bounds do not capture the simultaneous interaction of the different levels. 



If we remove this restriction, then the following can be proved along similar lines as Theorem p .4. 
Lemma 5.2 The lower bound for sorting in the multi-level cache model is 

□ 

This bound appears weak if k is large. To rectify this, we observe the following. Across each cache 
boundary, the minimum number of I/Os follow from Aggarwal and Vitter's arguments. The difficulty arises 
in the multi-level model as a block transfer in level i propagates in all levels j < i although the block 
sizes are different. The minimum number of I/Os from (the highest) level k remains unaffected, namely, 
N log N/Bk p Qr j eye j _ i we w in subtract this number from the lower bound of D N , l °% N ^ B , k ^' 1 . 

B k \ogC k /B k B k _ 1 logCfc_i/Bfe_i 

Continuing in this fashion, we obtain the following lower bound. 
Theorem 5.6 The lower bound for sorting in the multi-level cache model is 



. Bilogd/Bi \ BjlogCj/Bj 

1 = 1 \ \7=2+l J J' J 




□ 



C 1 Ft 

If we further assume that £$4- ^ ;> 3 > 

we obtain a relatively simple expression that resembles 
Theorem |5^. Note that the consecutive terms in the expression in the second summation of the previous 
lemma decrease by a factor of 3. 



Corollary 5.7 The lower bound for sorting in the multi-level cache model with geometrically decreasing 

jV-log N/Bj ■ 

Bi log eye,- 



cache sizes and cache lines is Q,(N log N + ^ Yli=i ^ ' B v^c^fB )• D 



Theorem 5.8 In a multi-level cache, where the Bi blocks are composed of Bi-\ blocks, we can sort in 
expected time O (VlogiV + (gg^) ■ E*U 4 • ft)- 

Proof: We perform a Mi/i?i-way mergesort using the variation proposed by Barve et al. |7|] in the context 
of parallel disk I/Os. The main idea is to shift each sorted stream cyclically by a random amount Ri for the 
ith stream. If Ri £ [0, — 1], then the leading element is in any of the cache sets with equal likelihood. 
Like Barve et al. P], we divide the merging into phases where a phase outputs m elements, where m is the 
merge degree. In the previous section we counted the number of conflict misses for the input streams, since 
we could exploit symmetry based on the random input. It is difficult to extend the previous arguments to a 
worst case input. However, it can be shown easily that if ^ < (where s is the number of cache sets), the 
expected number of conflict misses is 0(1) in each phase. So the total expected number of cache misses is 
0(N/Bi) in the level i cache for all 1 ^ i ^ k. 

The cost of writing a block of size B\ from level k is spread across several levels. The cost of transferring 

B k /Bi blocks of size B x from level k is £ k + 4-1-^7 + 4-2^*- iri 1 ^ h ^Ir- Amortizin g tnis 

cost over B k j B\ transfers gives us the required result. Recall that O (n/B x (^S^B~)) Bl block transfers 



\ 1 V^lo g M 1 /B 1 

suffice for (Mi /Bi) 1 ^ 3 -way mergesort. □ 
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Remark 14 This bound is reasonably close to that of Corollary ^7 if we ignore constant factors. Extending 
this to the more general emulation scheme of Theorem 3T is not immediate as we require the block transfers 
across various cache boundaries to have a nice pattern, namely the sub-block property. This is satisfied by 
the mergesort and quicksort and a number of other algorithms but cannot be assumed in general. 



5.4 Cache-oblivious sorting 

In this section, we will focus on a two-level cache model that has limited associativity. One of the cache- 



oblivious algorithms presented by Frigo et al. [ ]18| ] is the funnel sort algorithm. They showed that the 
algorithm is optimal in the I/O model (which is fully associative). However it is not clear whether the 
optimality holds in the cache model. In this section, we show that, with some simple modification, funnel 
sort is optimal even in the direct-mapped cache model. 
The funnel sort algorithm can be described as follows. 

• Split the input into n 1 / 3 contiguous arrays of size n 2 / 3 and sort these arrays recursively. 

• Merge the n 1 / 3 sorted sequences using a n 1 / 3 -merger, where a A;-merger works as follows. 

A A;-merger operates by recursively merging sorted sequences. Unlike mergesort, a A;-merger stops 
working on a merging sub-problem when the merged output sequence becomes "long enough" and resumes 
working on another merging sub-problem (see Figure 

INVARIANT The invocation of a fc-merger outputs the first k 3 elements of the sorted sequence obtained 
by merging the k input sequences. 

Base Case k = 2 producing k 3 = 8 elements whenever invoked. 

NOTE The intermediate buffers are twice the size of the output obtained by a k 1 ! 2 merger. 

To output A; 3 elements, the £>merger is invoked k 3 / 2 times. Before each invocation the /c-merger fills 
each buffer that is less than half full so that every buffer has at least k 3 / 2 elements — the number of elements 
to be merged in that invocation. 



Frigoe? al. []18j] have shown that the above algorithm (that does not make explicit use of the various 
memory-size parameters) is optimal in the I/O model. However, the I/O model does not account for conflict 
misses since it assumes full associativity. This could be a degrading influence in the presence of limited 
associativity (in particular direct-mapping). 



5.4.1 Structure of A>merger 

It is sufficient to get a bound on cache misses in the cache model since the bounds for capacity misses in the 
cache model are the same as the bounds shown in the I/O model. 

Let us get an idea of what the structure of a A;-merger looks like by looking at a 16-merger (see Figure ||]). 
A fc-merger, unrolled, consists of 2-mergers arranged in a tree-like fashion. Since the number of 2-mergers 
gets halved at each level and the initial input sequences are k in number there are lg k levels. 

Lemma 5.3 If the buffers are randomly placed and the starting position is also randomly chosen (since the 
buffers are cyclic this is easy to do) the probability of conflict misses is maximized if the buffers are less than 
one cache line long. 

The worst case for conflict misses occurs when the buffers are less than one cache line in size. This is 
because if the buffers collide then all data that goes through them will thrash. If however the size of the 
buffers were greater than one cache line then even if some two elements collide the probability of future 
collisions would depend upon the data input or the relative movement of data in the two buffers. The 
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Figure 4: Recursive definition of a k-merger in terms of /c 1 / 2 -mergers 



19 



2-merger 



4-merger 




16-merger 



Figure 5: Expansion of a 16-merger into 2-mergers 
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probability of conflict miss is maximized when the buffers are less than one cache line. Then probability of 
conflict is 1/m, where m is equal to the cache size M divided by the cache line size B, i.e., the number of 
cache lines. 

5.4.2 Bounding conflict misses 

The analysis for compulsory and capacity misses goes through without change from the I/O model to the 
cache model. Thus, funnel sort is optimal in the cache model if the conflict misses can be bounded by 

N log N/B 
~B X log M/B 



Lemma 5.4 If the cache is 3-way or more set associative, there will be no conflict misses for a 2-way 
merger. 

Proof: The two input buffers and the output buffer, even if they map to the same cache set can reside 
simultaneously in the cache. Since at any stage only one 2-merger is active there will be no conflict misses 
at all and the cache misses will only be in the form of capacity or compulsory misses. □ 



5.4.3 Direct-Mapped case 

For an input of size N, a iV^-merger is created. The number of levels in such a merger is log iV 1 / 3 ( i.e., 
the number of levels of the tree in the unrolled merger). Every element that travels through the A rl / 3 -merger 
sees log N 1 / 3 2-mergers (see Figure ^). For an element passing through a 2-merger there are 3 buffers that 
could collide. We charge an element for a conflict miss if it is swapped out of the cache before it passes 
to the output buffer or collides with the output buffer when it is being output. So the expected number of 
collisions is 3 C2 times the probability of collision between any two buffers (two input and one output). Thus 
the expected number of collisions for a single element passing through a 2-merger is 3 C2 X 1/m < 3/m 
where m = M/B. 

If Xij is the probability of a cache miss for element i in level j then summing over all elements and all 
levels we get 

(N TV 1 / 3 \ N logiV 1 / 3 

E E •'•<■.' = E E E M 
i=l j=l J i=l j=l 

N logAf 1 / 3 

< y y l = ^ xlogJV V3 

i=l 3=1 

= O ( — x log N 
V m 



Lemma 5.5 The expected performance of funnel sort is optimal in the direct-mapped cache model if 'log ^ < 



B'HogB ' ft iJ a ^ so optional for a 3-way associative cache. 
Proof: If M and B are such that 

, M M 
log — < 



B ~ B 2 log B 
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Figure 6: A k-merger expanded out into 2-mergers 
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we have the total number of conflict misses 



NlogN NlogN N log N/B 

m ~ Blog BjMj-B* log M/B 



Note that the condition is satisfied for M ^ B 2+e for any fixed e > which is similar to the tall-cache 
assumption made by Frigo et al. . 



The set associative case is proved by Lemma 5.4. □ 



The same analysis is applicable between successive levels d and Ci+i of a multi-level cache model. 
This yields an optimal algorithm for sorting in the multilevel cache model. 

Theorem 5.9 In a multi-level cache model, the number of cache misses at level Ci in the funnel sort algo- 
rithm can be bounded by b- 1 ^{m-Jb ) - 

This bound matches the lower bound of Lemma ^ within a constant factor, which makes it an optimal 
algorithm when simultaneous transfers are not allowed across multiple levels. 

6 Conclusions 

We have presented a cache model for designing and analyzing algorithms. Our model, while closely related 
to the I/O model of Aggarwal and Vitter, incorporates three additional salient features of cache: lower miss 
penalty, limited associativity, and lack of direct program control over data movement. We have established 
an emulation scheme that allows us to systematically convert an I/O-efficient algorithm into a cache-efficient 
algorithm. This emulation provides a generic starting point for cache-conscious algorithm design; it may 
be possible to further improve cache performance by problem-specific techniques to control interference 
misses. We have also demonstrated the relevance of the emulation scheme by demonstrating that a direct 
mapping of an I/O-efficient algorithm does not guarantee a cache-efficient algorithm. Finally, we have 
extended our basic cache model to multiple cache levels. 

Our single-level cache model is based on a blocking direct-mapped cache that does not distinguish 
between reads and writes. Modeling a non-blocking cache or distinguishing between reads and writes would 
appear to require queuing-theoretic extensions and does not appear to be appropriate at the algorithm design 
stage. The translation lookaside buffer or TLB is another important cache in real systems that caches virtual- 
to-physical address translations. Its peculiar aspect ratio and high miss penalty raise different concerns for 
algorithm design. Our preliminary experiments with certain permutation problems suggests that TLBs are 
important to model and can contribute significantly to program running times. 

We have begun to implement some of these algorithms to validate the theory on real machines, and 
also using cache simulation tools like fast-cache, ATOM, or cprof. Preliminary observations indicate that 
our predictions are more accurate with respect to miss ratios than actual running times (see [|l2|]). We have 
traced a number of possible reasons for this. First, because the cache miss latencies are not astronomical, it 
is important to keep track of the constant factors. An algorithmic variation that guarantees lack of conflict 
misses at the expense of doubling the number of memory references may turn out to be slower than the 
original algorithm. Second, our preliminary experiments with certain permutation problems suggests that 
TLBs are important to model and can contribute significantly to program running times. Third, several 
low-level details hidden by the compiler related to instruction scheduling, array address computations, and 
alignment of data structures in memory can significantly influence running times. As argued earlier, these 
factors are more appropriate to tackle at the level of implementation than algorithm design. 

Several of the cache problems we observe can be traced to the simple array layout schemes used in 



current programming languages. It has shown elsewhere [1C, 11, Bin that nonlinear array layout schemes 
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based on quadrant-based decomposition are better suited for hierarchical memory systems. Further study of 
such array layouts is a promising direction for future research. 
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A Approximating probability of conflict 

Let n be the number of elements between Sij and Sij+x, i.e., one less than the difference in ranks of 
Si j and Si j+i. (fi may be 0, which guarantees event El.) Let E m denote the event that \i = m. Then 
Pv[El] = J2 m Pt[E1 n E m ], since E m 's are disjoint. For each m, Pv[El n E m ] = Pr[El\E m ] ■ Pv[E m ]. 
The events E m correspond to a geometric distribution, i.e., 

Pi[E m ]=Pv[fi = m} = -h--\ . (2) 

To compute Pr[El[i5 OT ], we further subdivide the event into cases about how the m numbers are dis- 
tributed into the sets Sj,j ^ i. Wlog, let i = 1 to keep notations simple. Let rri2, • • • , rrik denote the 
case that rrij numbers belong to sequence Sj • rrij = m). We need to estimate the probability that for 
sequence Sj, bj does not conflict with S(6i) (recall that we have fixed i = 1) during the course that rrij 
elements arrive in Sj. This can happen only if §(fcj) (the cache set position of the leading block of Sj right 
after element Sij) does not lie roughly \mi/B~\ blocks from §(&i). From assumption Al and some careful 
counting this is 1 — mj s ^ rB for rrij ^ 1. For rrij = 0, this probability is 1 since no elements go into Sj 
and hence there is no conflict.^ These events are independent from our assumption Al and hence these can 

4 The reader will soon realize that this case leads to some non-trivial calculations. 
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be multiplied. The probability for a fixed partition mi , . . . , is the multinomial m2 ™; m i i 
partitioned into k — 1 parts). Therefore we can write the following expression for Pr[£'l|£J m ] 



Pr[£l|£„ 
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In the remainder of this section, we will obtain an upper bound on the right hand side of equation @. 
Let nz(rri2, ■ ■ ■ , m^) denote the number of js for which mj ^ (non-zero partitions). Then equation (g) 
can be rewritten as the following inequality. 



Pr[El\E m ] < Yl 
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(4) 



for rrij ^ 1. In other words, the right side is the expected value of 

(l — l\ NZ ( m ' k w here NZ(m, k — 1) denotes the number of non-empty bins when m balls are thrown 
into k — 1 bins. Using equation @ and the preceding discussion, we can write down an upper bound for the 
(unconditional) probability of El as 



^ k V k 



m=0 



■ E 



-,\NZ(m.k-l)' 



(5) 



We use known sharp concentration bounds for the occupancy problem to obtain the following approxi- 
mation for the expression (Bl) in terms of s and k. 



Theorem A.l ([22]) Let r = m/n, and Y be the number of empty bins when m balls are thrown randomly 
into n bins. Then _ 

E[Y] =n( 1 ] ~ n< 



rn 



and for A > 



Pr[|Y - E[Y}\ > A] < 2exp ( - *^L-M ) . 



□ 



Corollary A.2 Let NZ be the number of non-empty bins when m balls are thrown into k bins. Then 

E[NZ] = k(l - e- m/k ) 

and 

Pt[\NZ - E[NZ}\ ^ ay / 2 k log k] < l/k a . 

□ 

So in equation (Q), ^[(l — i\ NZ ( m,k ^] can be bounded by 

/ , \ fe(l-e- m /'=-Qv / 2fclogfc/fc) 
l/k a (1 - 1/8) + ( 1 - - J (6) 

for any a and m ^ 1. 



26 



Proof: (of Lemma 4T): We will split up the summation of (||) into two parts, namely, m ^ e/2 • k and 
m > e/2 • k. One can obtain better approximations by refining the partitions, but our objective here is to 
demonstrate the existence of e and S and not necessarily obtain the best values. 



oo , , m / -, \ NZ(m,k-l) ek/2k . . m , . JVZ(ro,fc-l) 
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The first term can be upper bounded by 



ek/2 
m=0 
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which is ~ 1 hn ~ 0.74. 

e r-i 
The second term can be bounded using equation (g) using a ^ 2. 
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The first term of the previous equation is less than and the second term can be bounded by 



E e(i-e' -n- 



0.25fc 



fc V A; / V s 

m=efc/2+l 

for sufficiently large k (k > 80 suffices). This can be bounded by ~ 0.25e _0,25fc//s , so equation <JSj> can be 
bounded by l/k + 0.25e°' 25fc//s . Adding this to the first term of equation ([7]), we obtain an upper bound of 
0.75 + 0.25e-° 25fe / s for k > 100. Subtracting this from 1 gives us 1 ~ e ~°' 25fc/s , i.e., 5 ^ i-e-°- 25fc / 3 □ 
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